Part 2 : Exploratory Data Analysis


This section is dedicated to understanding the data. We will provide an analysis of the data set using a visual approach in order to summarize their main characteristics.

We will analyze some basic statistical elements for each variable. To do this we need to transform the variable Date to date format.

Below, we look at the general aspect of the data set and try to discover if there are any missing values.

rows columns all_missing_columns total_missing_values complete_rows total_observations
430 34 0 2 428 14620


There are 2 missing values in the feature Glu_P. One is at the instance number 377 and the other one is at the instance 378.

Country Mountain_range Locality Plot Subplot Date Glu_P
377 Chile Central Andes Baños de Colinas 76 2 2014-01-21 NA
378 Chile Central Andes Baños de Colinas 76 3 2014-01-21 NA


To understand the distribution of our data in the data set we use the following graph:


There is a predominance of the continous columns compared to the discrete columns. Most of our variables will use both discrete and continuous features. We can also notice that the share of missing observations represents only 0.014% of the total number of observations. Which at first sight makes it a good data set.

We take our search for anomalies further by exploring the characteristics of each variable. For the readability of the report, we show only a few variables.

## Country 
##        n  missing distinct 
##      430        0        2 
##                       
## Value      Chile Spain
## Frequency    100   330
## Proportion 0.233 0.767
## Mountain_range 
##        n  missing distinct 
##      430        0        3 
##                                                                          
## Value             Central Andes     Central Pyrenees Sierra de Guadarrama
## Frequency                   100                  135                  195
## Proportion                0.233                0.314                0.453
## Phos_P 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      430        0      268        1    3.477     2.26   0.6777   1.0753 
##      .25      .50      .75      .90      .95 
##   1.9318   3.1503   4.7146   6.2213   7.1411 
## 
## lowest : 0.01980997 0.16225689 0.25041658 0.31268048 0.32430518
## highest: 7.73917664 7.90167428 8.05414937 8.64896403 8.64973041
## Glu_P 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      428        2      272        1    2.102     1.24   0.2986   0.5062 
##      .25      .50      .75      .90      .95 
##   1.3679   2.0922   2.7710   3.2561   3.9541 
## 
## lowest : 0.1074305 0.1110000 0.1140850 0.1150734 0.1619782
## highest: 4.5538748 4.7529319 5.2224173 5.5441612 6.3505287
## NT_P 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      430        0      273        1    3.971    2.688   0.4501   0.7760 
##      .25      .50      .75      .90      .95 
##   2.1575   3.9155   5.6767   7.2597   8.0787 
## 
## lowest :  0.1156901  0.1610857  0.1833210  0.2201596  0.2365415
## highest:  8.4518587  8.9110000  9.0285000 10.8160000 18.0010000

We can observe that there is a big difference between the total number of observations and the number of distinct observations for the variables related to the chemical elements. We will try to understand where this difference comes from in the visual analysis part.

Data visualization : Plotting the data


We plot the numerical variables.


Many variables appear with a distribution that looks like the log-normal distribution.

  • The variables pH_B and pH_T are not normally distributed.
  • Most of the variables are right-tailed.


We will then use box-plots to detect outliers on numerical variable and compare the distributions to each mountain class.


We can see the real differences between the mountains.

  • Generally speaking, it seems that the mountain “Sierra de Guadarrama” has a higher value of Glu_P, Phos_P, SOC_P, Glu_B, Phos_B, SOC_B, Glu_T, Phos_T and SOC_T.
  • There are a lot of outliers in almost all the features.


We then plot the categorical variables:


We see that more observations come from Spain, which is normal since two out of three mountains are located in Spain. The localities where the samples were taken are almost all composed of a sample of 5 subsamples. Some localities, perhaps more interesting for the study, were sampled several times, but always by a multiple of 5 subsamples.


We have more observations about the mountain “Sierra de Guadarrama” (195) compared to “Central Andes” (100) and “Central Pyrenees” (135). As the differences between the number of observation is big enough for us to be careful on the results and consider to balance the data if it is needed.

We can comment on the number of different observations and the effect on accuracy. We can also focus more on sensibility and sensitivy if needed since there is twice more informations on Sierra de Guadarrama.

As described above with the summary output of the data, we see that we have more information on the mountain “Sierra de Guadarrama”. There is twice more information compared to the mountain “Central Andes”. Our final result might be affected on a bad way because the model will tend to produce a good accuracy (so having a tendency to predict “Sierra de Guadarrama” more often) but it will not be good enough to predict a new instance.

We will have to see if we will need to balance our data to get a better model.

We will also inspect the possible duplicate observations, indeed as previously found, some variables do not have the whole of their observations which are distinct.


We notice immediately the poverty of the data concerning the samples of Sierra de Guadarrama, this function leaves us with a data set of only 274 observations. We will therefore try a first time to implement our models by keeping the duplicates, knowing that identical values in the train set and the test set will influence the measured accuracy of the model. Then we will test again our models with the reduced data set to observe if there is a loss of accuracy.

For the rest of the EDA we will continue the analysis on the complete data set.


From the correlation plot it seems that some pattern can be observed. The variables concerning the Phosphatase enzyme seems to be positively correlated with the variable about Soil organic carbon.


With this plot, we see indeed that the families of Soil organic carbon and Phosphatase enzyme are significantly positively correlated. The correlation coefficient going from 0.739 (SOC_B - Phos_P) to 0.947 (SOC_T - SOC_P).

Principal Component Analysis (PCA) exploration:

This analysis helps us to understand the link between the explanatory variables.


The first step is to analyse the data in the covarianve matrix as we did before, and where we found the positive correlation between the Soil organic carbon and Phosphatase enzyme.
The second step is to group the data into Principal Components.
The third step is to produce a variable factor map to better understand the role of each factor.


Second step - Principal Component Result:

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.3080 2.2090 1.6387 1.26851 1.02032 0.97230 0.82856
## Proportion of Variance 0.4377 0.1952 0.1074 0.06436 0.04164 0.03781 0.02746
## Cumulative Proportion  0.4377 0.6329 0.7403 0.80469 0.84633 0.88414 0.91161
##                           PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.7106 0.63948 0.58236 0.5050 0.41430 0.37199 0.32284
## Proportion of Variance 0.0202 0.01636 0.01357 0.0102 0.00687 0.00554 0.00417
## Cumulative Proportion  0.9318 0.94816 0.96173 0.9719 0.97879 0.98433 0.98850
##                           PC15    PC16    PC17    PC18    PC19   PC20    PC21
## Standard deviation     0.25571 0.23393 0.21525 0.17576 0.17150 0.1415 0.12739
## Proportion of Variance 0.00262 0.00219 0.00185 0.00124 0.00118 0.0008 0.00065
## Cumulative Proportion  0.99111 0.99330 0.99515 0.99639 0.99757 0.9984 0.99902
##                          PC22    PC23    PC24    PC25
## Standard deviation     0.1115 0.08012 0.05935 0.04711
## Proportion of Variance 0.0005 0.00026 0.00014 0.00009
## Cumulative Proportion  0.9995 0.99977 0.99991 1.00000
  • We see that the first component explain 43.77%% of the overall variation. -The second component explain a further 19.52%.
  • With a reduction of 4 principal components, we obtain a cumulative variance of 80.5%, superior to the threshold of 75%.
  • The rest of the components (Components 4 to 25) explain 19.5% overall.

Here, as the command prcomp do not allow NAs in the data. We use the command na.omit on our reduced data containing the numerical values to omit all NAs cases from the data frame.

For the further analysis, we can study as well the eigenvalues in order to select a good number of components.

Eigenvalue analysis:
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  10.936952114     43.747808455                    43.74781
## Dim.2   4.863450112     19.453800446                    63.20161
## Dim.3   2.694400932     10.777603728                    73.97921
## Dim.4   1.602728782      6.410915127                    80.39013
## Dim.5   1.045432762      4.181731049                    84.57186
## Dim.6   0.954822102      3.819288408                    88.39115
## Dim.7   0.688518236      2.754072944                    91.14522
## Dim.8   0.503178240      2.012712959                    93.15793
## Dim.9   0.411065663      1.644262652                    94.80220
## Dim.10  0.336691388      1.346765554                    96.14896
## Dim.11  0.254731126      1.018924506                    97.16789
## Dim.12  0.172594714      0.690378855                    97.85826
## Dim.13  0.138243331      0.552973325                    98.41124
## Dim.14  0.105400519      0.421602078                    98.83284
## Dim.15  0.067346066      0.269384264                    99.10222
## Dim.16  0.054954619      0.219818475                    99.32204
## Dim.17  0.046558386      0.186233542                    99.50828
## Dim.18  0.031588992      0.126355968                    99.63463
## Dim.19  0.029601829      0.118407318                    99.75304
## Dim.20  0.020191077      0.080764308                    99.83380
## Dim.21  0.016267999      0.065071996                    99.89888
## Dim.22  0.012623924      0.050495696                    99.94937
## Dim.23  0.006886114      0.027544454                    99.97692
## Dim.24  0.003535893      0.014143571                    99.99106
## Dim.25  0.002235080      0.008940321                   100.00000

We obtain the cumulative variance, as before, and also the eigenvalues.

  • A first rule of thumb is to stop adding components whent the total variance explained exceeds a high value, like 80% for example.
  • Another rule is the Kaiser-Guttman rule which states that components with an eigenvalue greater than 1 should be retained. The reason for this is we have ‘p’ variables so the sum of the eigenvalues is ‘p’. A value above 1 is above average.


Therefore, we can consider the dimension from 1 to 5:

  • Cumulative variance: 84.57%
  • Eigenvalue > 1


Screeplot of eigenvalue.


  • A screeplot represents the values of each eigenvalue.
  • According to the Kaiser-Guttman rule, we shold stop at Component 5.


Third step - Variable Factor Map:

The variable factor map show the variables and organized them along dimensions. Here the first two dimensions are represented.



Other representation:


Dimension 1 (x-axis): highly correlated to Phos_T, Phos_B, Phos_P and Glu_T
Dimension 1 is moderately correlated to PT_B
Dimension 1 is poorly correlated to Cond_T and Cond_B.
Dimension 2 is well correlated to Cond_T and Cond_B.
Dimension 2 is also moderatly negatively correlated to Radiation.
It seems that we have 4 groups of variables playing a different role. On these two dimensions we notice that the mountain classes already separate into 3 distinct clusters

  • Indeed, Sierra de Guadarrama is more positively correlated to the Dim1
  • Central Andes is porely negatively correlated to the Dim 1, bur more negatively correlated to Dim 2: Negatively with Cond_T
  • Central Pyrenees is negatively correlated to Dim 1, and positively correlated to Dim 2.


Study of each variable according to the 5 dimensions:



Dim 1: Highly correlated to the PHOS, SOCand GLU
Dim 2: Correlated with Cond_P and Cond_T
Dim 3: Correlated with PT_P, PT_B and PT_T
Dim 4: Moderately correlated to K_B and K_T
Dim 5: Correlated to Radiation

Put into representation the 5 dimensions:


The square cosine shows the importance of a component for a given observation. It is therefore normal that observations close to the origin are less significant than those far from it. Here we decided to represent only one variable of each type since the same chemical elements tend to have the same behavior independently of their sampling method. A variable that has an interesting behavior is Radiation, indeed the more we select high dimensions the more this variable becomes important (except for dimension 4), while the variables related to chemical elements tend to decrease. Thus, we find radiation strongly correlated with dimension 5.

Analysis in 3D

As seen in the EDA, we can consider 5 dimensions. In the following graph we reduce in 3 dimensions the 3 mountains. Clusters may be apparent.



Through this 3D, we can observe the distribution of the 3 mountains in the PCA.Further in the analysis we will do a cluster analysis, to better understand the apparent classification between the mountains.